Does Visual Self-Supervision Improve Learning of Speech Representations for Emotion Recognition?

نویسندگان

چکیده

Self-supervised learning has attracted plenty of recent research interest. However, most works for self-supervision in speech are typically unimodal and there been limited work that studies the interaction between audio visual modalities cross-modal self-supervision. This article (1) investigates via face reconstruction to guide representations; (2) proposes an audio-only approach representation learning; (3) shows a multi-task combination proposed is beneficial richer features more robust noisy conditions; (4) self-supervised pretraining can outperform fully supervised training especially useful prevent overfitting on smaller sized datasets. We evaluate our learned representations discrete emotion recognition, continuous affect recognition automatic recognition. existing methods all tested downstream tasks. Our results demonstrate potential feature suggest joint leads informative

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning Corpus-Invariant Discriminant Feature Representations for Speech Emotion Recognition

As a hot topic of speech signal processing, speech emotion recognition methods have been developed rapidly in recent years. Some satisfactory results have been achieved. However, it should be noted that most of these methods are trained and evaluated on the same corpus. In reality, the training data and testing data are often collected from different corpora, and the feature distributions of di...

متن کامل

Learning Spontaneity to Improve Emotion Recognition In Speech

We investigate the effect and usefulness of spontaneity in speech (i.e. whether a given speech data is spontaneous or not) in the context of emotion recognition. We hypothesize that emotional content in speech is interrelated with its spontaneity, and thus propose to use spontaneity classification as an auxiliary task to the problem of emotion recognition. We propose two supervised learning set...

متن کامل

Active learning for dimensional speech emotion recognition

State-of-the-art dimensional speech emotion recognition systems are trained using continuously labelled instances. The data labelling process is labour intensive and time-consuming. In this paper, we propose to apply active learning to reduce according efforts: The unlabelled instances are evaluated automatically, and only the most informative ones are intelligently picked by an informativeness...

متن کامل

Feature Transfer Learning for Speech Emotion Recognition

Speech Emotion Recognition (SER) has achieved some substantial progress in the past few decades since the dawn of emotion and speech research. In many aspects, various research efforts have been made in an attempt to achieve human-like emotion recognition performance in real-life settings. However, with the availability of speech data obtained from different devices and varied acquisition condi...

متن کامل

Representation Learning for Speech Emotion Recognition

Speech emotion recognition is an important problem with applications as varied as human-computer interfaces and affective computing. Previous approaches to emotion recognition have mostly focused on extraction of carefully engineered features and have trained simple classifiers for the emotion task. There has been limited effort at representation learning for affect recognition, where features ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Transactions on Affective Computing

سال: 2023

ISSN: ['1949-3045', '2371-9850']

DOI: https://doi.org/10.1109/taffc.2021.3062406